Mining News Sites to Create Special Domain News Collections

نویسنده

  • David B. Bracewell
چکیده

We present a method to create special domain collections from news sites. The method only requires a single sample article as a seed. No prior corpus statistics are needed and the method is applicable to multiple languages. We examine various similarity measures and the creation of document collections for English and Japanese. The main contributions are as follows. First, the algorithm can build special domain collections from as little as one sample document. Second, unlike other algorithms it does not require a second “general” corpus to compute statistics. Third, in our testing the algorithm outperformed others in creating collections made up of highly relevant articles. Keywords—Information Retrieval, News, Special Domain Collections,

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

The Comparative Study of White Marriage News based on Effective News Components in the Persian Section of Al-Alam, Persian BBC and Voice of America News Sites

Purpose: The aim of present research was the comparative study of white marriage news based on effective news components in the Persian section of Al-Alam, Persian BBC and Voice of America news sites. Methodology: This study in terms of purpose was applied and in terms of implementation method was quantitative. The research community was the white marriage news of Persian section of Al-Alam, P...

متن کامل

The need to create a media block for the convergence of overseas news networks

As a general diplomacy arm of the Islamic Republic of Iran, VoSiMa has extensive activities in international broadcasting of its radio and television programs. These programs are broadcast in different languages, such as English, French, Azeri, Arabic, and ... for regional and transnational audiences. The large volume of the organization's international activities is in the form of news and new...

متن کامل

Mining Large-scale Comparable Corpora from Chinese-English News Collections

In this paper, we explore a CLIR-based approach to construct large-scale Chinese-English comparable corpora, which is valuable for translation knowledge mining. The initial source and target document sets are crawled from news website and standardized uniformly.

متن کامل

PERSEUS: Personalized Multimedia News Portal

This paper describes the Perseus project, which is devoted to developing techniques and tools for creating personalized multimedia news portals. The purpose of a personalized multimedia news portal is to provide relevant information, selected from newswire sites on the Internet and augmented by video clips automatically extracted from TV broadcasts, based on the user’s preferences. To create su...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008